White Wine by Khem Veasna

We will embark in the study of wine data.

The data was obtained from https://www.google.com/url?q=https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv&sa=D&usg=AFQjCNHSo6vCJWIjCOZw6Kyy-C79XNFQUg

Let’s take a look at the variables.

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

How about their type?

str(wine)
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

We see that quality is a numeric which will be a problem and we’ll address this below.

Univariate Analysis

What is the structure of your dataset?

summary(wine) 
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

We can see the distribution of the quality of wine with this histogram.

hist(as.numeric(wine$quality))

Let’s look at some features. What might influence the quality of wine?

aggregate(sulphates ~ quality, wine, mean)
##   quality sulphates
## 1       3 0.4745000
## 2       4 0.4761350
## 3       5 0.4822032
## 4       6 0.4911056
## 5       7 0.5031023
## 6       8 0.4862286
## 7       9 0.4660000
aggregate(alcohol ~ quality, wine, mean)
##   quality  alcohol
## 1       3 10.34500
## 2       4 10.15245
## 3       5  9.80884
## 4       6 10.57537
## 5       7 11.36794
## 6       8 11.63600
## 7       9 12.18000

So we see the average sulphates and alcohol amount for each quality.

I want to know which feature has the most variation?

sapply(wine, sd, na.rm=TRUE)
##                    X        fixed.acidity     volatile.acidity 
##         1.414075e+03         8.438682e-01         1.007945e-01 
##          citric.acid       residual.sugar            chlorides 
##         1.210198e-01         5.072058e+00         2.184797e-02 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##         1.700714e+01         4.249806e+01         2.990907e-03 
##                   pH            sulphates              alcohol 
##         1.510006e-01         1.141258e-01         1.230621e+00 
##              quality 
##         8.856386e-01

It looks like total.sulfur.dioxide has the most variation and we’ll dig into this further below.

What is/are the main feature(s) of interest in your dataset?

The main interest is to see what feature (or combination of features) of the dataset affects quality of wine the most.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From the description of the features, those that affect taste are: volatile acidity citric acid residual sugar chlorides total sulfur dioxide

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The problem here is that an feature that is of interest– “quality” – should probably be a category. The value of the quality are 0,1,2,…10. The value is discrete– 10 being very good quality and 0 being bad quality.

We will change the ‘quality’ feature into a category below as part of the preprocessing step.

#Process the data to make quality be a category

wine$quality <- factor(wine$quality)

Bivariate Plots Section

This is the result of ggpairs on the wine data.

title.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In the last row, we can see that for high quality wine >= 8, those sensory features are low in quantity. The sensory features are: volatile acidity citric acid residual sugar chlorides total sulfur dioxide

Let’s look into this a bit more, e.g. with sulphates.

We can see that higher higher quality wines have less amount of sulphates with the histogram plot below.

ggplot(aes(x=total.sulfur.dioxide), data=wine) +  geom_histogram() + facet_wrap( ~quality) +  scale_fill_brewer(type = 'qual')

ggplot(aes(x=alcohol), data=wine) +  geom_histogram(binwidth=0.1) + facet_wrap( ~quality) +  scale_fill_brewer(type = 'qual')

ggplot(aes(x=sulphates/alcohol), data=wine) +  geom_histogram(binwidth=0.1) + facet_wrap( ~quality) +  scale_fill_brewer(type = 'qual')

ggplot(aes(x=citric.acid), data=wine) +  geom_histogram(binwidth=0.1) + facet_wrap( ~quality) +  scale_fill_brewer(type = 'qual')

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

From the ggpairs plot, we can see that some variables have interesting relationships with each other. For example, 0) citric acid increases as fixed acidity increases 1) pH decrease as fixed acidity increase 2) alcohol decreases as density increases 3) as chloride increase so does the density

Let’s take another look at total.sulfur.dioxide.

quality_group <- group_by(wine, quality)
summarize(quality_group, total_sulphates = mean(total.sulfur.dioxide),n = n())
## Source: local data frame [7 x 3]
## 
##   quality total_sulphates    n
## 1       3        170.6000   20
## 2       4        125.2791  163
## 3       5        150.9046 1457
## 4       6        137.0473 2198
## 5       7        125.1148  880
## 6       8        126.1657  175
## 7       9        116.0000    5

What was the strongest relationship you found?

The strongest relationship I found with quality is total.sulfur.dioxide.

Multivariate Plots Section

Multivariate Analysis

The relationship between total sulfur dioxide vs quality and citric acid vs quality is similar in that they increase when quality increase.

This is seen from the ggpairs plot above.

Now, maybe the ratio of total sulfur dioxide to citric acid has an interesting relationship to quality. So we’ll try this.

new_df <- wine  %>%  group_by(quality) %>% mutate(total_sulfur_dioxide_over_citric_acid = total.sulfur.dioxide/citric.acid)

ggplot(aes(x=as.numeric(quality), y=total_sulfur_dioxide_over_citric_acid), data=new_df) + geom_point(fill=I('#F79420'), color=I('orange'), alpha = 0.5,  position = position_jitter(h=0))  + scale_x_continuous() + scale_y_continuous(limits=c(0, quantile(new_df$total_sulfur_dioxide_over_citric_acid, 0.99)))  +  geom_line(stat = 'summary', fun.y = mean) 

We can see that the ratio decreases as the quality increases.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It appears that all these sensory features strengthen each other: volatile acidity citric acid residual sugar chlorides total sulfur dioxide


Final Plots and Summary

Plot One

Acidity vs citric acid

We’ll check the acidity vs citric acid as they seem to be related.

# scatter
ggplot(aes(x=fixed.acidity, y=citric.acid), data=wine) + geom_point(fill=I('#F79420'), color=I('black'), alpha = 0.3, shape=21, position = position_jitter(w = 0.1, h = 0.1))  + scale_x_continuous(limits=c(3.8, quantile(wine$fixed.acidity, 0.99))) + scale_y_continuous(limits=c(0.0001, quantile(wine$citric.acid, 0.99))) + stat_smooth(method='lm')

Here we can see that citric acid increases as fixed acidity increases.

Plot Two

fixed.acidity vs pH

We’ll check the fixed.acidity vs pH as they seem to be related.

# scatter
ggplot(aes(x=fixed.acidity, y=pH), data=wine) + geom_point(fill=I('#F79420'), color=I('black'), alpha = 0.3, shape=21, position = position_jitter(w = 0.1, h = 0.1))  + scale_x_continuous(limits=c(3.8, quantile(wine$fixed.acidity, 0.99))) + scale_y_continuous(limits=c(2.720, quantile(wine$pH, 0.99))) + stat_smooth(method='lm')
## Warning: Removed 88 rows containing missing values (stat_smooth).
## Warning: Removed 123 rows containing missing values (geom_point).

Here we see pH decrease as fixed acidity increase.

Plot Three

density vs alcohol

# scatter
ggplot(aes(x=density, y=alcohol), data=wine) + geom_point(fill=I('#F79420'), color=I('black'), shape=21)  + scale_x_continuous(limits=c(0.9871, 1.0390)) + scale_y_continuous(limits=c(7.9, 14.20)) + stat_smooth(method='lm')
## Warning: Removed 57 rows containing missing values (geom_path).

Here alcohol decreases as density increases.

Plot Four

chlorides vs density

# scatter
ggplot(aes(x=chlorides, y=density), data=wine) + geom_point(fill=I('#F79420'), color=I('black'), shape=21)  + scale_x_continuous(limits=c(0, quantile(wine$chlorides, 0.99))) + scale_y_continuous(limits=c(0.9861, quantile(wine$density, 0.99))) + stat_smooth(method='lm')

As chlorides increase so does the density.

Reflection

The wine data has over 4898 observations with features that describes how a wine may smell or taste. This is what we assume affects the quality (which is subjective in itself). Some preprocessing was needed to work with the data. The feature ‘quality’ was of interest but it was in numeric form instead of a factor. This feature is categorical; i.e. a wine may be labeled 0,1,2…10 depending on the quality. It appears that good quality wine have lower amounts of the sensory features than lower quality wine:

volatile acidity citric acid residual sugar chlorides total sulfur dioxide

The description of these variables from the data site suggests that these features affect the smell and/or taste. Now, there’s evidence that the ratio of some of these features affects the quality. It appears the higher quality wines have a lower total sulfur dioxide to citric acid.

As a follow up to this project, I would look at other ratios of other features.